Guessers for Finite-State Transducer Lexicons
نویسنده
چکیده
Language software applications encounter new words, e.g., acronyms, technical terminology, names or compounds of such words. In order to add new words to a lexicon, we need to indicate their inflectional paradigm. We present a new generally applicable method for creating an entry generator, i.e. a paradigm guesser, for finite-state transducer lexicons. As a guesser tends to produce numerous suggestions, it is important that the correct suggestions be among the first few candidates. We prove some formal properties of the method and evaluate it on Finnish, English and Swedish full-scale transducer lexicons. We use the open-source Helsinki Finite-State Technology [1] to create finitestate transducer lexicons from existing lexical resources and automatically derive guessers for unknown words. The method has a recall of 82-87 % and a precision of 71-76 % for the three test languages. The model needs no external corpus and can therefore serve as a baseline.
منابع مشابه
Scaling an Irish FST Morphology Engine for Use on Unrestricted Text
This paper details the steps involved in scaling-up a lexicalised finite-state morphology transducer for use on unrestricted text. Our starting point was a base-line inflectional morphology engine [1], with 81% token coverage measured against a 15 million word corpus of Irish texts [2]. Manually scaling the FST lexicon component of a morphology transducer is time-consuming, expensive and rarely...
متن کاملA generalized composition algorithm for weighted finite-state transducers
This paper describes a weighted finite-state transducer composition algorithm that generalizes the concept of the composition filter and presents filters that remove useless epsilon paths and push forward labels and weights along epsilon paths. This filtering permits the compostion of large speech recognition contextdependent lexicons and language models much more efficiently in time and space ...
متن کاملJCLext: A Java Tool for Compiling Finite-State Transducers from Full-Form Lexicons
JCLexT is a compiler of finite-state transducers from full-form lexicons, this tool seems to be the first Java implementation of such functionality. A comparison between JCLexT and Foma was performed based on extensive data from Portuguese. The main disadvantage of JCLexT is the slower compilation time, in comparison to Foma. However, this is negated by the fact that a large transducer compiled...
متن کاملHeuristic Hyper-minimization of Finite State Lexicons
Flag diacritics, which are special multi-character symbols executed at runtime, enable optimising finite-state networks by combining identical sub-graphs of its transition graph. Traditionally, the feature has required linguists to devise the optimisations to the graph by hand alongside the morphological description. In this paper, we present a novel method for discovering flag positions in mor...
متن کاملFinite-State Morphology of Estonian: Two-Levelness Extended
The paper is concentrated on modeling the Estonian morphology in the framework of twolevel morphology model. The result is a consistent description of Estonian morphology, which consists of a network of lexicons (root lexicons cover 2500 most frequent word roots) and two-level rules. The main rule set contains 45 rules, which describe various stem changes. The subset of rules dealing with stem ...
متن کامل